Purpose: The purpose of this project is to analyze a number of trends in the Apple App Store. I chose to do this project using this data because I’m interested in mobile app development and currently have an application on the App Store. Specifically, my application is for music and social media, so I will be heavily analyzing these genres specifically.
# clean up workspace environment
rm(list = ls())
#Packages
library(mosaic)
Loading required package: dplyr
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
Loading required package: lattice
Loading required package: ggformula
Loading required package: ggplot2
Loading required package: ggstance
Attaching package: 㤼㸱ggstance㤼㸲
The following objects are masked from 㤼㸱package:ggplot2㤼㸲:
geom_errorbarh, GeomErrorbarh
New to ggformula? Try the tutorials:
learnr::run_tutorial("introduction", package = "ggformula")
learnr::run_tutorial("refining", package = "ggformula")
Loading required package: mosaicData
Loading required package: Matrix
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
The 'mosaic' package masks several functions from core packages in order to add
additional features. The original behavior of these functions should not be affected by this.
Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
Attaching package: 㤼㸱mosaic㤼㸲
The following object is masked from 㤼㸱package:Matrix㤼㸲:
mean
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
stat
The following objects are masked from 㤼㸱package:dplyr㤼㸲:
count, do, tally
The following objects are masked from 㤼㸱package:stats㤼㸲:
binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test, quantile,
sd, t.test, var
The following objects are masked from 㤼㸱package:base㤼㸲:
max, mean, min, prod, range, sample, sum
library(tidyverse)
[37m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --[39m
[37m[32mv[37m [34mtibble [37m 2.1.3 [32mv[37m [34mpurrr [37m 0.3.2
[32mv[37m [34mtidyr [37m 1.0.0 [32mv[37m [34mstringr[37m 1.4.0
[32mv[37m [34mreadr [37m 1.3.1 [32mv[37m [34mforcats[37m 0.4.0[39m
[37m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[37m [34mmosaic[37m::[32mcount()[37m masks [34mdplyr[37m::count()
[31mx[37m [34mpurrr[37m::[32mcross()[37m masks [34mmosaic[37m::cross()
[31mx[37m [34mmosaic[37m::[32mdo()[37m masks [34mdplyr[37m::do()
[31mx[37m [34mtidyr[37m::[32mexpand()[37m masks [34mMatrix[37m::expand()
[31mx[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31mx[37m [34mggstance[37m::[32mgeom_errorbarh()[37m masks [34mggplot2[37m::geom_errorbarh()
[31mx[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()
[31mx[37m [34mtidyr[37m::[32mpack()[37m masks [34mMatrix[37m::pack()
[31mx[37m [34mmosaic[37m::[32mstat()[37m masks [34mggplot2[37m::stat()
[31mx[37m [34mmosaic[37m::[32mtally()[37m masks [34mdplyr[37m::tally()
[31mx[37m [34mtidyr[37m::[32munpack()[37m masks [34mMatrix[37m::unpack()[39m
library(DataComputing)
library(party)
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo
Attaching package: 㤼㸱zoo㤼㸲
The following objects are masked from 㤼㸱package:base㤼㸲:
as.Date, as.Date.numeric
Loading required package: sandwich
Attaching package: 㤼㸱strucchange㤼㸲
The following object is masked from 㤼㸱package:stringr㤼㸲:
boundary
The chunk below loads all data from the two datasets regarding the Apple App Store
#This dataset contains a number of analytics about over 7000 applications on the App Store.
#Information includes rating, price, genre, etc.
dataset_1 <- 'C:/Users/angel/Dropbox/Penn State/STAT 184/Project/App_Store_Analysis/AppleStore.csv'
App_Store_Data <- read.csv(file = dataset_1, header=TRUE, sep=",")
#This dataset contains a number of analytics about over 7000 applications on the App Store.
#Information includes rating, price, genre, etc.
dataset_2 <- 'C:/Users/angel/Dropbox/Penn State/STAT 184/Project/App_Store_Analysis/appleStore_description.csv'
App_Store_Description_Data <- read.csv(file = dataset_2, header=TRUE, sep=",")
App_Store_Data %>%
sample_n(size = 10)
NA
App_Store_Description_Data %>%
sample_n(size = 10)
The first thing that I wanted to do with this dataset is gather some general statistics about the apps within the App Store.
#This code displays the mean size of each app, app rating, and number of languages supported within an app.
App_Store_Data %>%
summarise(num_apps = n(),
mean_app_size = mean(size_bytes),
mean_app_rating = mean(user_rating),
mean_supported_languages = mean(lang.num))
The following is a general breakdown of the categories of apps in the App Store. Below is a chart containing the specific number of apps in each category, in addition to a plot showing these numbers. This plot shows a more visually appealing way of looking at just how diverse the App Store really is. Based on this data, it’s very clear that games take up a very large population in the App Store.
App_Categories <- App_Store_Data %>%
group_by(prime_genre) %>%
App_Categories
ggplot(data=App_Categories,aes(x=prime_genre,y=num_on_store ,fill=prime_genre))+geom_bar(stat='identity',position='stack', width=.9)+theme(axis.text.x=element_text(angle = 90, vjust = 0.5))+ xlab('App Category') + ylab('Number on Store')
#layer1 <- geom_point(data = App_Store_Data, aes(shape = user_rating))
#layer2 <- geom_point(data = App_Store_Data, aes(shape = user_rating_ver))
#App_Store_Data %>%
# ggplot(aes(x = id, y = user_rating))+geom_point()
Specific classification for music and social media
#music data
Music_App_Data <-
App_Store_Data %>%
filter(prime_genre == 'Music')
#social network data
Social_Network_App_Data <-
App_Store_Data %>%
filter(prime_genre == 'Social Networking')
#get highest rated music app names
#get highest rated social media app names
#compare ratings of free vs paid for each.
#Use ML to take the user_rating_ver (rating of current app version) and predict the user_rating (overall app version) for each category